{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9UBCYU0NaobY"
   },
   "source": [
    "# 求近义词和类比词\n",
    "\n",
    "在[“word2vec的实现”](./word2vec-nn.ipynb)一节中，我们在小规模数据集上训练了一个word2vec词嵌入模型，并通过词向量的余弦相似度搜索近义词。实际中，在大规模语料上预训练的词向量常常可以应用到下游自然语言处理任务中。本节将演示如何用这些预训练的词向量来求近义词和类比词。我们还将在后面两节中继续应用预训练的词向量。\n",
    "\n",
    "## 使用预训练的词向量\n",
    "\n",
    "`torchtext`包提供了跟自然语言处理相关的函数和类（更多参见torchtext文档 [1]）。下面查看它目前提供的预训练词嵌入的简称。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 55
    },
    "colab_type": "code",
    "id": "l6DdnU4Taoba",
    "outputId": "af985cfa-d251-4368-fd48-a48a8ffe2c7e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d'])"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "from torchtext import vocab\n",
    "\n",
    "vocab.pretrained_aliases.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gCAQ6kW6aobk"
   },
   "source": [
    "给定词嵌入名称，可以查看该词嵌入调用的对象及参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "OnOuWVDmaobk",
    "outputId": "1de8b951-4421-4fc6-bce3-ae4c12843c1a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "functools.partial(<class 'torchtext.vocab.GloVe'>, name='6B', dim='50')\n"
     ]
    }
   ],
   "source": [
    "print(vocab.pretrained_aliases['glove.6B.50d'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dYEU7Rcpaobo"
   },
   "source": [
    "torchtext中预训练的GloVe模型的命名规范大致是“模型.（数据集.）数据集词数.词向量维度”。更多信息可以参考GloVe和fastText的项目网站 [2,3]。下面我们使用基于维基百科子集预训练的50维GloVe词向量。第一次创建预训练词向量实例时会自动下载相应的词向量，因此需要联网。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "XA1ONrC4aobp"
   },
   "outputs": [],
   "source": [
    "glove_6b50d = vocab.GloVe(name='6B', dim=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tg2CQ6opaobz"
   },
   "source": [
    "打印词典大小。其中含有40万个词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "OZ_Ey0Yvaob1",
    "outputId": "bdb2cac5-d67f-490f-c909-e1ee59795415"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "400000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(glove_6b50d.vectors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "S-fuHeM3aob7"
   },
   "source": [
    "我们可以通过词来获取它在词典中的索引，也可以通过索引获取词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "uFDcHbWZaob9",
    "outputId": "9b6fbf12-af4d-408f-9a91-4260d1d4986b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3366, 'beautiful')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "glove_6b50d.stoi['beautiful'], glove_6b50d.itos[3366]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "EDrsh9cmaocG"
   },
   "source": [
    "## 应用预训练词向量\n",
    "\n",
    "下面我们以GloVe模型为例，展示预训练词向量的应用。\n",
    "\n",
    "### 求近义词\n",
    "\n",
    "这里重新实现[“word2vec的实现”](./word2vec-nn.ipynb)一节中介绍过的使用余弦相似度来搜索近义词的算法。为了在求类比词时重用其中的求$k$近邻（$k$-nearest neighbors）的逻辑，我们将这部分逻辑单独封装在`knn`函数中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0esUCDSiaocH"
   },
   "outputs": [],
   "source": [
    "def knn(W, x, k):\n",
    "    # 添加的1e-9是为了数值稳定性\n",
    "    cos = torch.mm(W, x.reshape(-1, 1)) / (\n",
    "        (torch.sum(W * W, dim=1, keepdim=True) + 1e-9).sqrt() * torch.sum(x * x).sqrt())\n",
    "    _, topk = torch.topk(cos, k=k, dim=0)\n",
    "    return topk, [cos[i].item() for i in topk]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YGw9G_44aocM"
   },
   "source": [
    "然后，我们通过预训练词向量实例`embed`来搜索近义词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "SJ_rAiYeaocN"
   },
   "outputs": [],
   "source": [
    "def get_similar_tokens(query_token, k, embed):\n",
    "    topk, cos = knn(embed.vectors,\n",
    "                    embed[query_token], k+1)\n",
    "    for i, c in zip(topk[1:], cos[1:]):  # 除去输入词\n",
    "        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oFIUoZJhaocS"
   },
   "source": [
    "已创建的预训练词向量实例`glove_6b50d`的词典中含40万个词和1个特殊的未知词。除去输入词和未知词，我们从中搜索与“chip”语义最相近的3个词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 89
    },
    "colab_type": "code",
    "id": "nBq8NoQ8aocV",
    "outputId": "40b4906a-747b-4439-f76b-40e9d11b0d20"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.856: chips\n",
      "cosine sim=0.749: intel\n",
      "cosine sim=0.749: electronics\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('chip', 3, glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ykcQ4ws-aoce"
   },
   "source": [
    "接下来查找“baby”和“beautiful”的近义词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 89
    },
    "colab_type": "code",
    "id": "oIshpLkheYkn",
    "outputId": "a0759dd9-9208-435f-9106-fa353f2125a8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.839: babies\n",
      "cosine sim=0.800: boy\n",
      "cosine sim=0.792: girl\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('baby', 3, glove_6b50d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 89
    },
    "colab_type": "code",
    "id": "hKE6lk-oebrd",
    "outputId": "a80d0de6-d012-4001-e9c2-f9ef6f18d139"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine sim=0.921: lovely\n",
      "cosine sim=0.893: gorgeous\n",
      "cosine sim=0.830: wonderful\n"
     ]
    }
   ],
   "source": [
    "get_similar_tokens('beautiful', 3, glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qPHSxqYFaoco"
   },
   "source": [
    "### 求类比词\n",
    "\n",
    "除了求近义词以外，我们还可以使用预训练词向量求词与词之间的类比关系。例如，“man”（男人）: “woman”（女人）:: “son”（儿子） : “daughter”（女儿）是一个类比例子：“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为：对于类比关系中的4个词 $a : b :: c : d$，给定前3个词$a$、$b$和$c$，求$d$。设词$w$的词向量为$\\text{vec}(w)$。求类比词的思路是，搜索与$\\text{vec}(c)+\\text{vec}(b)-\\text{vec}(a)$的结果向量最相似的词向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "uzagCt7faoco"
   },
   "outputs": [],
   "source": [
    "def get_analogy(token_a, token_b, token_c, embed):\n",
    "    vecs_a, vecs_b, vecs_c = embed[token_a], embed[token_b], embed[token_c]\n",
    "    x = vecs_b - vecs_a + vecs_c\n",
    "    topk, cos = knn(embed.vectors, x, 1)\n",
    "    return embed.itos[topk[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tdTKEK2Xaocs"
   },
   "source": [
    "验证一下“男-女”类比。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "18"
    },
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "tn3K_G0Oaocs",
    "outputId": "fca1c7cc-e4e5-4fb3-d74e-6380163a9220"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'daughter'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('man', 'woman', 'son', glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "2D07IjNwaocv"
   },
   "source": [
    "“首都-国家”类比：“beijing”（北京）之于“china”（中国）相当于“tokyo”（东京）之于什么？答案应该是“japan”（日本）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "19"
    },
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "0DOGTsB-aocv",
    "outputId": "55e4a52e-5d41-4e79-9871-bdd90c8b2b03"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'japan'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('beijing', 'china', 'tokyo', glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SUNdVC7Yaoc0"
   },
   "source": [
    "“形容词-形容词最高级”类比：“bad”（坏的）之于“worst”（最坏的）相当于“big”（大的）之于什么？答案应该是“biggest”（最大的）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "20"
    },
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "cuO_M6Fpaoc0",
    "outputId": "7ab2dc9f-d2f4-4745-e88a-083fa8be5351"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'biggest'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('bad', 'worst', 'big', glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "05prNHluaoc3"
   },
   "source": [
    "“动词一般时-动词过去时”类比：“do”（做）之于“did”（做过）相当于“go”（去）之于什么？答案应该是“went”（去过）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "21"
    },
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "SI77LtVfaoc3",
    "outputId": "20e70470-4ee0-440e-fd19-3e38a710e2eb"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'went'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_analogy('do', 'did', 'go', glove_6b50d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_Y5Arx-Aaoc6"
   },
   "source": [
    "## 小结\n",
    "\n",
    "* 在大规模语料上预训练的词向量常常可以应用于下游自然语言处理任务中。\n",
    "* 可以应用预训练的词向量求近义词和类比词。\n",
    "\n",
    "\n",
    "## 练习\n",
    "\n",
    "* 测试一下fastText的结果。值得一提的是，fastText有预训练的中文词向量（`pretrained_file_name='wiki.zh.vec'`）。\n",
    "* 如果词典特别大，如何提升近义词或类比词的搜索速度？\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## 参考文献\n",
    "\n",
    "[1] torchtext工具包。 https://torchtext.readthedocs.io/en/latest/index.html\n",
    "\n",
    "[2] GloVe项目网站。 https://nlp.stanford.edu/projects/glove/\n",
    "\n",
    "[3] fastText项目网站。 https://fasttext.cc/\n",
    "\n",
    "## 扫码直达[讨论区](https://discuss.gluon.ai/t/topic/4373)\n",
    "\n",
    "![](../img/qr_similarity-analogy.svg)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "similarity-analogy.ipynb",
   "provenance": [],
   "toc_visible": true,
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python [conda env:pytorch]",
   "language": "python",
   "name": "conda-env-pytorch-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
